AUTHORS: Pallavi Ingale, Sanjay Nalbalwar
Download as PDF
ABSTRACT: Supervised speech segregation for cochannel speech signal can be made easier if we use predetermined speaker’s models instead of taking models for all the population. Here we propose a signal to signal ratio (SSR) independent method to detect speaker identities from a cochannel speech signal with unique speaker specific features for speaker identification. Proposed Kekre’s Transform Cepstral Coefficient (KTCC) features are the robust acoustic features for speaker identification. A text independent speaker identification system is utilized for identifying speakers in short segments of test signal. Gaussian mixture modeling (GMM) classifier is used for the identification task. We compare the proposed method with a system utilizing conventional features called Mel Frequency Cepstral Coefficient (MFCC) features. Spontaneous speech utterances from candidates are taken for experimentation instead of utterances that follow a command like structure with a unique grammatical structure and have a limited word list in speech separation challenge (SSC) corpus. Identification is performed on short segments of the cochannel mixture. Two Speakers who have been identified for most of segments of the cochannel mixture are selected as two speakers detected for the same cochannel mixture. Average speaker detection accuracy of 93.56% is achieved in case of two speaker cochannel mixture for of KTCC features. This method produces best results for cochannel speaker identification even being text independent. Speaker identification performance is also checked for various test segment lengths. KTCC features outperform in speaker identification task even the length of speech segment is very short.
KEYWORDS: Detection of speaker identities, text independent speaker identification, cochannel speech, KTCC
REFERENCES:
[1] W. Yu, L. Jiajun, C. Ning, and Y. Wenhao, Improved monaural speech segregation based on computational auditory scene analysis, EURASIP Journal on Audio, Speech, and Music Processing, Vol. 2013, No.2, 2013, pp. 1-15.
[2] H, Ke, and D. Wang, An iterative model-based approach to cochannel speech separation, EURASIP Journal on Audio, Speech, and Music Processing, Vol. 2013, No.1 2013, pp. 1-11.
[3] Y. Wang, and D. Wang, A structure-preserving training target for supervised speech separation, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 6107-6111.
[4] A. Reddy and B. Raj, Soft mask methods for single-channel speaker separation, IEEE Transactions on Audio, Speech and Language Processing, Vol.15, No.6, 2007, pp. 1766- 1776.
[5] K, Gibak, Y. Lu, Y. Hu, and P. Loizou, An algorithm that improves speech intelligibility in noise for normal-hearing listeners, The Journal of the Acoustical Society of America, Vol. 126, No.3, 2009, pp. 1486-1494.
[6] Y. Shao, S. Srinivasan, Z. Jin, and D. Wang, A computational auditory scene analysis system for speech segregation and robust speech recognition, Computer Speech & Language, Vol.24, No.1, 2010, pp. 77-93.
[7] J. R. Hershey, S. J. Rennie, P. A. Olsen, and T.T. Kristjansson, Super-human multi-talker speech recognition: A graphical modeling approach, Computer Speech & Language, Vol.24, No.1, 2010, pp. 45-66.
[8] P. Mowlaee, R. Saeidi, M. G. Christensen, Z. H. Tan, T. Kinnunen, P. Franti, and S. H. Jensen, A joint approach for single-channel speaker identification and speech separation, IEEE Transactions on Audio, Speech, and Language Processing, Vol.20, No.9, 2012, pp. 2586-2601.
[9] D. A. Reynolds, Speaker identification and verification using Gaussian mixture speaker models, Speech communication, Vol.17, No.1, 1995, pp. 91-108.
[10] M. Cooke, J. R. Hershey, and S. J. Rennie, Monaural speech separation and recognition challenge, Computer Speech & Language, Vol. 24. No.1, 2010, pp. 1-15.
[11] X. Zhao, Y. Wang, and D. Wang, Deep neural networks for cochannel speaker identification, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4824-4828.
[12] J. M. Naik, Speaker Verification: A Tutorial, IEEE Communications Magazine, Vol. 28. No.1, 1990. pp. 42-28.
[13] H. B. Kekre, S. D. Thepade, and A. Maloo, Performance Comparison of Image Retrieval Using Fractional Coefficients of Transformed Image Using DCT, Walsh, Haar and Kekre’s Transform, CSC-International Journal of Image processing (IJIP), Vol.4, No.2, 2010, pp. 142-155.
[14] D. A. Reynolds, and R. C. Rose, Robust textindependent speaker identification using Gaussian mixture speaker models, IEEE transactions on speech and audio processing, Vol.3, No.1, 1995, pp. 72-83.
[15] T. Giannakopoulos, A. Pikrakis, Introduction to Audio Analysis: A MATLAB® Approach. Academic Press, 2014.